89 research outputs found
Inferring Program Transformations from Type Transformations for Partitioning of Ordered Sets
In this paper I introduce a mechanism to derive program transforma- tions
from order-preserving transformations of vector types. The purpose of this work
is to allow automatic generation of correct-by-construction instances of
programs in a streaming data processing paradigm suitable for FPGA processing.
We show that for it is possible to automatically derive instances for programs
based on combinations of opaque element- processing functions combined using
foldl and map, purely from the type transformations.Comment: This work is supported by the EPSRC through the TyTra project
(EP/L00058X/1
Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems
In this report we present a novel approach to model coupling for
shared-memory multicore systems hosting OpenCL-compliant accelerators, which we
call The Glasgow Model Coupling Framework (GMCF). We discuss the implementation
of a prototype of GMCF and its application to coupling the Weather Research and
Forecasting Model and an OpenCL-accelerated version of the Large Eddy Simulator
for Urban Flows (LES) developed at DPRI.
The first stage of this work concerned the OpenCL port of the LES. The
methodology used for the OpenCL port is a combination of automated analysis and
code generation and rule-based manual parallelization. For the evaluation, the
non-OpenCL LES code was compiled using gfortran, fort and pgfortran}, in each
case with auto-parallelization and auto-vectorization. The OpenCL-accelerated
version of the LES achieves a 7 times speed-up on a NVIDIA GeForce GTX 480
GPGPU, compared to the fastest possible compilation of the original code
running on a 12-core Intel Xeon E5-2640.
In the second stage of this work, we built the Glasgow Model Coupling
Framework and successfully used it to couple an OpenMP-parallelized WRF
instance with an OpenCL LES instance which runs the LES code on the GPGPI. The
system requires only very minimal changes to the original code. The report
discusses the rationale, aims, approach and implementation details of this
work.Comment: This work was conducted during a research visit at the Disaster
Prevention Research Institute of Kyoto University, supported by an EPSRC
Overseas Travel Grant, EP/L026201/
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
A performance model of multicast communication in wormhole-routed networks on-chip
Collective communication operations form a part of overall traffic in most applications running on platforms employing direct interconnection networks. This paper presents a novel analytical model to compute communication latency of multicast as a widely used collective communication operation. The novelty of the model lies in its ability to predict the latency of the multicast communication in wormhole-routed architectures employing asynchronous multi-port routers scheme. The model is applied to the Quarc NoC and its validity is verified by comparing the model predictions against the results obtained from a discrete-event simulator developed using OMNET++
The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition
We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible
framework for parallel task-composition based many-core programming. We allow
the programmer to structure programs into task code, written as C++ classes,
and communication code, written in a restricted subset of C++ with functional
semantics and parallel evaluation. In this paper we discuss the GPRM, the
virtual machine framework that enables the parallel task composition approach.
We focus the discussion on GPIR, the functional language used as the
intermediate representation of the bytecode running on the GPRM. Using examples
in this language we show the flexibility and power of our task composition
framework. We demonstrate the potential using an implementation of a merge sort
algorithm on a 64-core Tilera processor, as well as on a conventional Intel
quad-core processor and an AMD 48-core processor system. We also compare our
framework with OpenMP tasks in a parallel pointer chasing algorithm running on
the Tilera processor. Our results show that the GPRM programs outperform the
corresponding OpenMP codes on all test platforms, and can greatly facilitate
writing of parallel programs, in particular non-data parallel algorithms such
as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221
Cache-aware Parallel Programming for Manycore Processors
With rapidly evolving technology, multicore and manycore processors have
emerged as promising architectures to benefit from increasing transistor
numbers. The transition towards these parallel architectures makes today an
exciting time to investigate challenges in parallel computing. The TILEPro64 is
a manycore accelerator, composed of 64 tiles interconnected via multiple 8x8
mesh networks. It contains per-tile caches and supports cache-coherent shared
memory by default. In this paper we present a programming technique to take
advantages of distributed caching facilities in manycore processors. However,
unlike other work in this area, our approach does not use architecture-specific
libraries. Instead, we provide the programmer with a novel technique on how to
program future Non-Uniform Cache Architecture (NUCA) manycore systems, bearing
in mind their caching organisation. We show that our localised programming
approach can result in a significant improvement of the parallelisation
efficiency (speed-up).Comment: This work was presented at the international symposium on Highly-
Efficient Accelerators and Reconfigurable Technologies (HEART2013),
Edinburgh, Scotland, June 13-14, 201
An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors
The emergence of multicore and manycore processors is set to change the
parallel computing world. Applications are shifting towards increased
parallelism in order to utilise these architectures efficiently. This leads to
a situation where every application creates its desirable number of threads,
based on its parallel nature and the system resources allowance. Task
scheduling in such a multithreaded multiprogramming environment is a
significant challenge. In task scheduling, not only the order of the execution,
but also the mapping of threads to the execution resources is of a great
importance. In this paper we state and discuss some fundamental rules based on
results obtained from selected applications of the BOTS benchmarks on the
64-core TILEPro64 processor. We demonstrate how previously efficient mapping
policies such as those of the SMP Linux scheduler become inefficient when the
number of threads and cores grows. We propose a novel, low-overhead technique,
a heuristic based on the amount of time spent by each CPU doing some useful
work, to fairly distribute the workloads amongst the cores in a
multiprogramming environment. Our novel approach could be implemented as a
pragma similar to those in the new task-based OpenMP versions, or can be
incorporated as a distributed thread mapping mechanism in future manycore
programming frameworks. We show that our thread mapping scheme can outperform
the native GNU/Linux thread scheduler in both single-programming and
multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Applications
Heterogeneous High-Performance Computing
(HPC) platforms present a significant programming challenge,
especially because the key users of HPC resources are scientists,
not parallel programmers. We contend that compiler technology
has to evolve to automatically create the best program variant
by transforming a given original program. We have developed a
novel methodology based on type transformations for generating
correct-by-construction design variants, and an associated
light-weight cost model for evaluating these variants for
implementation on FPGAs. In this paper we present a key
enabler of our approach, the cost model. We discuss how we
are able to quickly derive accurate estimates of performance
and resource-utilization from the design’s representation in our
intermediate language. We show results confirming the accuracy
of our cost model by testing it on three different scientific
kernels. We conclude with a case-study that compares a solution
generated by our framework with one from a conventional
high-level synthesis tool, showing better performance and
power-efficiency using our cost model based approach
An Intermediate Language and Estimator for Automated Design Space Exploration on FPGAs
We present the TyTra-IR, a new intermediate language intended as a
compilation target for high-level language compilers and a front-end for HDL
code generators. We develop the requirements of this new language based on the
design-space of FPGAs that it should be able to express and the
estimation-space in which each configuration from the design-space should be
mappable in an automated design flow. We use a simple kernel to illustrate
multiple configurations using the semantics of TyTra-IR. The key novelty of
this work is the cost model for resource-costs and throughput for different
configurations of interest for a particular kernel. Through the realistic
example of a Successive Over-Relaxation kernel implemented both in TyTra-IR and
HDL, we demonstrate both the expressiveness of the IR and the accuracy of our
cost model.Comment: Pre-print and extended version of poster paper accepted at
international symposium on Highly Efficient Accelerators and Reconfigurable
Technologies (HEART2015) Boston, MA, USA, June 1-2, 201
The impact of traffic localisation on the performance of NoCs for very large manycore systems
The scaling of semiconductor technologies is leading to processors with increasing numbers of cores. The adoption of Networks-on-Chip (NoC) in manycore systems requires a shift in focus from computation to communication, as communication is fast becoming the dominant factor in processor performance. In large manycore systems, performance is predicated on the locality of communication. In this work, we investigate the performance of three NoC topologies for systems with thousands of processor cores under two types of localised traffic. We present latency and throughput results comparing fat quadtree, concentrated mesh and mesh topologies under different degrees of localisation. Our results, based on the ITRS physical data for 2023, show that the type and degree of localisation of traffic significantly affects the NoC performance, and that scale-invariant topologies perform worse than flat topologies
- …